Like a rolling stone: Colonization and migration dynamics of the gray reef shark (Carcharhinus amblyrhynchos)

Abstract Designing appropriate management plans requires knowledge of both the dispersal ability and what has shaped the current distribution of the species under consideration. Here, we investigated the evolutionary history of the endangered gray reef shark (Carcharhinus amblyrhynchos) across its range by sequencing thousands of RADseq loci in 173 individuals in the Indo‐Pacific (IP). We first bring evidence of the occurrence of a range expansion (RE) originating close to the Indo‐Australian Archipelago (IAA) where two stepping‐stone waves (east and westward) colonized almost the entire IP. Coalescent modeling additionally highlighted a homogenous connectivity (Nm ~ 10 per generation) throughout the range, and isolation by distance model suggested the absence of barriers to dispersal despite the affinity of C. amblyrhynchos to coral reefs. This coincides with long‐distance swims previously recorded, suggesting that the strong genetic structure at the IP scale (F ST ~ 0.56 between its ends) is the consequence of its broad current distribution and organization in a large number of demes. Our results strongly suggest that management plans for the gray reef shark should be designed on a range‐wide rather than a local scale due to its continuous genetic structure. We further contrasted these results with those obtained previously for the sympatric but strictly lagoon‐associated Carcharhinus melanopterus, known for its restricted dispersal ability. Carcharhinus melanopterus exhibits a similar RE dynamic but is characterized by a stronger genetic structure and a nonhomogeneous connectivity largely dependent on local coral reefs availability. This sheds new light on shark evolution, emphasizing the roles of IAA as source of biodiversity and of life‐history traits in shaping the extent of genetic structure and diversity.


Comparison of site frequency spectrum using different assembly and variant calling pipelines
To empirically investigate the influence of low coverage on variant calling, we investigated the site frequency spectrum (SFS) reconstructed in the Bampton sampling site (N=10) using four assembly and variant calling pipelines: (1) S1: This pipeline is based on Stacks v.1. 48 (Catchen et al., 2013). We implemented the same assembly parameters as those applied with Stacks v.2.5 and detailed in the main text (namely, m=3, n=3, and N=3). Stacks v.1.48 uses the calling algorithm of (Lynch, 2009) which requires high coverage data for accurate genotype inference (Rochette et al., 2019).
Using a custom R script, we filtered: (i) SNPs heterozygotes in more than 80% of the sample; (ii) loci with coverage higher than the mean coverage plus twice the standard deviation; (iii) SNPs in the last 5bp of the assembled locus; and (iv) loci containing more than five SNPs, after visual inspection of the distribution of segregating sites per locus.
(2) PY: This pipeline uses the assembly algorithm implemented in PyRAD (Eaton, 2014). We applied the same parameters of Walsh et al., (2022), in order to compare their results to ours. The clustering threshold (level of similarity between sequences to be considered homologous) was set to 0.9, reads with more than 5 low quality bases were discarded, the minimum read depth for base calling was set to 6 and the maximum to 1000. The calling algorithm of PyRAD is the same as Stacks v.1.48. Loci with more than 5 SNP and sites with higher heterozygosity than 0.5 were also discarded using PyRAD pipeline. Using a custom R script, we additionally filtered depth by retaining only sites in the 90% core of the distribution of depth following Walsh et al., (2022).
(3) S2: This pipeline is based on Stacks v.2.5 (Rochette et al., 2019) and mainly differs from S1 and PY in the calling algorithm. Stacks v.2.5 implements the population-based bayesian framework of (Maruki and Lynch, 2017) for variant calling, which is supposed to be more accurate for low coverage data (Rochette et al., 2019). The assembly step and filters were performed similarly to S1 above.
(4) AN: This pipeline is the one we used in the main text for all sampling sites. It is based on a first assembly of a pseudo-reference sequence (as in (Khimoun et al., 2020;Heller et al., 2021) against which raw reads are mapped back, before using ANGSD (Korneliussen et al., 2014) for the genotype free allele frequency estimation. This pipeline has been previously described and successfully applied to low-coverage RAD-seq data ( Heller et al., 2021;Lesturgie, Planes, et al., 2022) and it is detailed in the main text.
We retained only loci with no missing data (monomorphic loci were used to properly scale the genetic diversity). The folded SFS was then computed by using a custom R script except for the folded SFS produced through the AN pipeline which was directly inferred using the RealSFS program implemented within the ANGSD framework. We computed the normalized SFS as in (Lapierre et al., 2017) to compare the distribution of alleles frequency between the four pipelines.

The expectation of the normalized SFS is a horizontal line in a panmictic population of constant
Ne (the standard coalescent model). The normalized SFS allows an immediate and qualitative description of the excess or deficit of low frequency variants compared to the standard coalescent model. We then inferred the variation of effective size (Ne) through time by modelling the SFS with the stairwayplot software (Liu and Fu, 2020). To be correctly scaled, the stairwayplot needs the total number of sites without missing data (monomorphic sites included) which were either directly extracted from the variant calling output (AN, S1 and S2) or estimated from the missing data rate detected in variant sites and the total number of sites assembled (PY). To compare the inferred stairwayplot with Walsh et al., (2022), we used their same generation time of 16.4 years and mutation rate μ=1.9434e-08 per site per generation. This mutation rate was taken from (Maisano Delser et al., 2016), who estimated it based on the exon capture data of the black tip reef shark C. melanopterus. This value was later adjusted to represent a true genomic average, since exon capture represent a genomic sample enriched in conserved regions (Lesturgie, Lainé, et al., 2022;Lesturgie, Planes, et al., 2022). Therefore, in the main text we used the corrected mutation rate of 1.93e-8 per site per generation and a generation time of 10 years as in (Lesturgie, Planes, et al., 2022).

References
Catchen J, Hohenlohe PA, Bassham S, Amores A, Cresko WA (2013). Stacks: an analysis tool set for population genomics. Mol Ecol 22: 3124-3140.  Figure S1. Correlation map between genetic diversity (θπ) and Least Cost (LC) distances when considering all sampling sites. Each cell is coloured according to the correlation coefficient value computed between θπ and the LC distance from the putative origin of the range expansion (RE). Black dots represent the sampling sites considered.